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Abstract — Many real-world problems deal with collections of high-dimensional data, such as images, videos, text and web documents, DNA 
microarray data, and more. Often, such high-dimensional data lie close to low-dimensional structures corresponding to several classes or 
categories to which the data belong. In this paper, we propose and study an algorithm, called Sparse Subspace Clustering (SSC), to cluster 
data points that lie in a union of low-dimensional subspaces. The key idea is that, among the infinitely many possible representations of a data 
point in terms of other points, a sparse representation corresponds to selecting a few points from the same subspace. This motivates solving a 
sparse optimization program whose solution is used in a spectral clustering framework to infer the clustering of the data into subspaces. Since 
solving the sparse optimization program is in general NP-hard, we consider a convex relaxation and show that, under appropriate conditions 
on the arrangement of the subspaces and the distribution of the data, the proposed minimization program succeeds in recovering the desired 
sparse representations. The proposed algorithm is efficient and can handle data points near the intersections of subspaces. Another key 
advantage of the proposed algorithm with respect to the state of the art is that it can deal directly with data nuisances, such as noise, 
sparse outlying entries, and missing entries, by incorporating the model of the data into the sparse optimization program. We demonstrate the 
effectiveness of the proposed algorithm through experiments on synthetic data as well as the two real-world problems of motion segmentation 
and face clustering. 



Index Terms — High-dimensional data, intrinsic low-dimensionality, subspaces, clustering, 
programming, spectral clustering, principal angles, motion segmentation, face clustering. 
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1 Introduction 

HIGH-DIMENSIONAL data are ubiquitous in many areas of 
machine learning, signal and image processing, computer 
vision, pattern recognition, bioinformatics, etc. For instance, 
images consist of billions of pixels, videos can have millions of 
frames, text and web documents are associated with hundreds of 
thousands of features, etc. The high-dimensionality of the data not 
only increases the computational time and memory requirements 
of algorithms, but also adversely affects their performance due to 
the noise effect and insufficient number of samples with respect 
to the ambient space dimension, commonly referred to as the 
"curse of dimensionality" |1|. However, high-dimensional data 
often lie in low-dimensional structures instead of being uniformly 
distributed across the ambient space. Recovering low-dimensional 
structures in the data helps to not only reduce the computational 
cost and memory requirements of algorithms, but also reduce 
the effect of high-dimensional noise in the data and improve the 
performance of inference, learning, and recognition tasks. 

In fact, in many problems, data in a class or category can 
be well represented by a low-dimensional subspace of the high- 
dimensional ambient space. For example, feature trajectories of 
a rigidly moving object in a video |2|, face images of a subject 
under varying illumination |3|, and multiple instances of a hand- 
written digit with different rotations, translations, and thicknesses 
f4 1 lie in a low-dimensional subspace of the ambient space. As a 
result, the collection of data from multiple classes or categories 
lie in a union of low-dimensional subspaces. Subspace clustering 
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(see f5\ and references therein) refers to the problem of separating 
data according to their underlying subspaces and finds numerous 
applications in image processing (e.g., image representation and 
compression |6|) and computer vision (e.g., image segmentation 
|7|, motion segmentation O, |[9l, and temporal video segmen- 
tation 1 10 1), as illustrated in Figures [T] and [2] Since data in 
a subspace are often distributed arbitrarily and not around a 
centroid, standard clustering methods |^ that take advantage 
of the spatial proximity of the data in each cluster are not in 
general applicable to subspace clustering. Therefore, there is a 
need for having clustering algorithms that take into account the 
multi-subspace structure of the data. 



1 .1 Prior Work on Subspace Clustering 

Existing algorithms can be divided into four main categories: iter- 
ative, algebraic, statistical, and spectral clustering-based methods. 

Iterative methods. Iterative approaches, such as K-subspaces 
(m, |[T3l and median K-flats flTl alternate between assigning 
points to subspaces and fitting a subspace to each cluster. The 
main drawbacks of such approaches are that they generally require 
to know the number and dimensions of the subspaces, and that 
they are sensitive to initialization. 

Algebraic approaches. Factorization-based algebraic approaches 
such as m, 121, ifTSl find an initial segmentation by thresholding 
the entries of a similarity matrix built from the factorization 
of the data matrix. These methods are provably correct when 
the subspaces are independent, but fail when this assumption 
is violated. In addition, they are sensitive to noise and outliers 
in the data. Algebraic-geometric approaches such as Generalized 
Principal Component Analysis (GPCA) HOl, (161, fit the data 
with a polynomial whose gradient at a point gives the normal 
vector to the subspace containing that point. While GPCA can 
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Fig. 1 . Motion segmentation: given feature points on multiple rigidly moving objects tracked in multiple frames of a video (top), the goal 
is to separate the feature trajectories according to the moving objects (bottom). 
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Fig. 2. Face clustering: given face images of multiple subjects (top), the goal is to find images that belong to the same subject (bottom). 



deal with subspaces of different dimensions, it is sensitive to noise 
and outliers, and its complexity increases exponentially in terms 
of the number and dimensions of subspaces. 

Statistical methods. Iterative statistical approaches, such as Mix- 
tures of Probabilistic PC A (MPPCA) (13, Multi-Stage Learning 
(MSL) fTSl, or |T9|, assume that the distribution of the data inside 
each subspace is Gaussian and alternate between data clustering 
and subspace estimation by applying Expectation Maximization 
(EM). The main drawbacks of these methods are that they gener- 
ally need to know the number and dimensions of the subspaces, 
and that they are sensitive to initialization. Robust statistical 
approaches, such as Random Sample Consensus (RANSAC) |20|, 
fit a subspace of dimension d to randomly chosen subsets of d 
points until the number of inliers is large enough. The inliers 
are then removed, and the process is repeated to find a second 
subspace, and so on. RANSAC can deal with noise and outliers, 
and does not need to know the number of subspaces. However, 
the dimensions of the subspaces must be known and equal. In 
addition, the complexity of the algorithm increases exponentially 
in the dimension of the subspaces. Information-theoretic statistical 
approaches, such as Agglomerative Lossy Compression (ALC) 
1211, look for the segmentation of the data that minimizes the 
coding length needed to fit the points with a mixture of degenerate 
Gaussians up to a given distortion. As this minimization problem 
is NP-hard, a suboptimal solution is found by first assuming that 
each point forms its own group, and then iteratively merging pairs 
of groups to reduce the coding length. ALC can handle noise 
and outliers in the data. While, in principle, it does not need to 
know the number and dimensions of the subspaces, the number 
of subspaces found by the algorithms is dependent on the choice 
of a distortion parameter. In addition, there is no theoretical proof 
for the optimality of the agglomerative algorithm. 

Spectral clustering-based methods. Local spectral clustering- 
based approaches such as Local Subspace Affinity (LSA) 11221 . 
Locally Linear Manifold Clustering (LLMC) |23 1, Spectral Local 
Best-fit Flats (SLBF) 1241 . and ll25ll use local information around 
each point to build a similarity between pairs of points. The 
segmentation of the data is then obtained by applying spectral 



clustering 1261 . 1271 to the similarity matrix. These methods 
have difficulties in dealing with points near the intersection of 
two subspaces, because the neighborhood of a point can contain 
points from different subspaces. In addition, they are sensitive to 
the right choice of the neighborhood size to compute the local 
information at each point. 

Global spectral clustering-based approaches try to resolve these 
issues by building better similarities between data points using 
global information. Spectral Curvature Clustering (SCC) |28 1 uses 
multi-way similarities that capture the curvature of a collection of 
points within an affine subspace. SCC can deal with noisy data 
but requires to know the number and dimensions of subspaces and 
assumes that subspaces have the same dimension. In addition, the 
complexity of building the multi-way similarity grows exponen- 
tially with the dimensions of the subspaces, hence, in practice, 
a sampling strategy is employed to reduce the computational 
cost. Using advances in sparse f29l, |30|, |31| and low-rank 
1321 . 1331 . l34l recovery algorithms. Sparse Subspace Clustering 
(SSC) Ea, m, 113, Low-Rank Recovery (LRR) EH, EH, 
f40l, and Low-Rank Subspace Clustering (LRSC) pTi algorithms 
pose the clustering problem as one of finding a sparse or low-rank 
representation of the data in the dictionary of the data itself. The 
solution of the corresponding global optimization algorithm is 
then used to build a similarity graph from which the segmentation 
of the data is obtained. The advantages of these methods with 
respect to most state-of-the-art algorithms are that they can handle 
noise and outliers in data, and that they do not need to know the 
dimensions and, in principle, the number of subspaces a priori. 

1 .2 Paper Contributions 

In this paper, we propose and study an algorithm based on 
sparse representation techniques, called Sparse Subspace Cluster- 
ing (SSC), to cluster a collection of data points lying in a union 
of low-dimensional subspaces. The underlying idea behind the 
algorithm is what we call the self-expressiveness property of the 
data, which states that each data point in a union of subspaces can 
be efficiently represented as a linear or affine combination of other 
points. Such a representation is not unique in general because 
there are infinitely many ways in which a data point can be 



expressed as a combination of other points. The key observation 
is that a sparse representation of a data point ideally corresponds 
to a combination of a few points from its own subspace. This 
motivates solving a global sparse optimization program whose 
solution is used in a spectral clustering framework to infer the 
clustering of data. As a result, we can overcome the problems 
of local spectral clustering-based algorithms, such as choosing 
the right neighborhood size and dealing with points near the 
intersection of subspaces, since, for a given data point, the sparse 
optimization program automatically picks a few other points that 
are not necessarily close to it but belong to the same subspace. 

Since solving the sparse optimization program is in general 
NP-hard, we consider its li relaxation. We show that, under mild 
conditions on the arrangement of subspaces and data distribution, 
the proposed ^i -minimization program recovers the desired solu- 
tion, guaranteeing the success of the algorithm. Our theoretical 
analysis extends the sparse representation theory to the multi- 
subspace setting where the number of points in a subspace is 
arbitrary, possibly much larger than its dimension. Unlike block- 
sparse recovery problems (IS, (431, O, ll45J, ||46l, 143 where 
the bases for the subspaces are known and given, we do not have 
the bases for subspaces nor do we know which data points belong 
to which subspace, making our case more challenging. We only 
have the sparsifying dictionary for the union of subspaces given 
by the matrix of data points. 

The proposed ^i -minimization program can be solved effi- 
ciently using convex programming tools (481, (49l ^ EDl and does 
not require initialization. Our algorithm can directly deal with 
noise, sparse outlying entries, and missing entries in the data as 
well as the more general class of affine subspaces by incorporating 
the data corruption or subspace model into the sparse optimization 
program. Finally, through experimental results, we show that 
our algorithm outperforms state-of-the-art subspace clustering 
methods on the two real-world problems of motion segmentation 
(Fig. [T]) and face clustering (Fig. [2]). 

Paper Organization. In Section [2] we motivate and introduce 
the SSC algorithm for clustering data points in a union of linear 
subspaces. In Section [3] we generalize the algorithm to deal with 
noise, sparse outlying entries, and missing entries in the data as 
well as the more general class of affine subspaces. In Section 
[4j we investigate theoretical conditions under which the li- 
minimization program recovers the desired sparse representations 
of data points. In Section [5] we discuss the connectivity of the 
similarity graph and propose a regularization term to increase the 
connectivity of points in each subspace. In Section [6j we verify 
our theoretical analysis through experiments on synthetic data. In 
Section [T] we compare the performance of SSC with the state of 
the art on the two real-world problems of motion segmentation 
and face clustering. Finally, Section [8] concludes the paper. 

2 Sparse Subspace Clustering 

In this section, we introduce the sparse subspace clustering (SSC) 
algorithm for clustering a collection of multi- subspace data using 
sparse representation techniques. We motivate and formulate the 
algorithm for data points that perfectly lie in a union of linear 
subspaces. In the next section, we will generalize the algorithm 
to deal with data nuisances such as noise, sparse outlying entries, 
and missing entries as well as the more general class of affine 
subspaces. 



Let {Si}^^^ be an arrangement of n linear subspaces of R^ 
of dimensions {di}]l^^. Consider a given collection of N noise- 
free data points {y^j^i that lie in the union of the n subspaces. 
Denote the matrix containing all the data points as 



Y^[V, ... y^,] = [Yi 



^„]r, 



(1) 



where Y i G R^^^^ is a rank-d^ matrix of the Ni > di points 
that lie in Si and T G R^^^ is an unknown permutation matrix. 
We assume that we do not know a priori the bases of the subspaces 
nor do we know which data points belong to which subspace. The 
subspace clustering problem refers to the problem of finding the 
number of subspaces, their dimensions, a basis for each subspace, 
and the segmentation of the data from Y . 

To address the subspace clustering problem, we propose an 
algorithm that consists of two steps. In the first step, for each data 
point, we find a few other points that belong to the same subspace. 
To do so, we propose a global sparse optimization program 
whose solution encodes information about the memberships of 
data points to the underlying subspace of each point. In the second 
step, we use these information in a spectral clustering framework 
to infer the clustering of the data. 

2.1 Sparse Optimization Program 

Our proposed algorithm takes advantage of what we refer to as 
the self-expressiveness property of the data, i.e., 

each data point in a union of subspaces can be efficiently re- 
constructed by a combination of other points in the dataset. 

More precisely, each data point for data point y^ G U^^^^S^ can 
be written as 

y^ = Yci, Cii = 0, (2) 

where c^ = [cn Ci2 ... c^n] and the constraint ca = 
eliminates the trivial solution of writing a point as a linear 
combination of itself. In other words, the matrix of data points 
1^ is a self-expressive dictionary in which each point can be 
written as a linear combination of other points. However, the 
representation of y^ in the dictionary Y is not unique in general. 
This comes from the fact that the number of data points in a 
subspace is often greater than its dimension, i.e., N£ > d£. As a. 
result, each Yi, and consequently Y, has a non-trivial nullspace 
giving rise to infinitely many representations of each data point. 
The key observation in our proposed algorithm is that among 
all solutions of ([2]), 

there exists a sparse solution, Ci, whose nonzero entries 
correspond to data points from the same subspace as y^. We 
refer to such a solution as a sub space -sparse representation. 

More specifically, a data point y^ that lies in the d^ -dimensional 
subspace S^ can be written as a linear combination of d^ other 
points in general directions from 5^. As a result, ideally, a sparse 
representation of a data point finds points from the same subspace 
where the number of the nonzero elements corresponds to the 
dimension of the underlying subspace. 

For a system of equations such as ^ with infinitely many 
solutions, one can restrict the set of solutions by minimizing an 
objective function such as the ^g-norm of the solution^ as 

Yd, Q, =0. (3) 



s.t. 
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Fig. 3. Three subspaces in R^ with 10 data points in each subspace, ordered such that the fist and the last 10 points belong to <Si and 
S3, respectively. The solution of the ^g-minimization program in ^ for y- lying in *Si for ^ = 1, 2, 00 is shown. Note that as the value of q 
decreases, the sparsity of the solution increases. For q=l, the solution corresponds to choosing two other points lying in Si. 



Different choices of q have different effects in the obtained 
solution. Typically, by decreasing the value of q from infinity 
toward zero, the sparsity of the solution increases, as shown in 
Figure [3] The extreme case ofq = corresponds to the general 
NP-hard problem |51| of finding the sparsest representation of 
the given point, as the ^o-norm counts the number of nonzero 
elements of the solution. Since we are interested in efficiently 
finding a non-trivial sparse representation of y^ in the dictionary 
Y, we consider minimizing the tightest convex relaxation of the 
^o-norm, i.e.. 



mm \\Cj 



s.t. y^ = Yci, Cii = 0, 



(4) 



which can be solved efficiently using convex programming tools 
1481 . II49II . ISOl and is known to prefer sparse solutions |i29J, l.30,L 

EB. 

We can also rewrite the sparse optimization program ^ for all 
data points z = 1 , . . . , A/" in matrix form as 



min||C||i s.t. Y = YC, diag(C) = 0, 



(5) 



where C = [ci C2 ... Cn\ G M is the matrix whose 

i-th column corresponds to the sparse representation of y^, Ci, 
and diag(C) G M^ is the vector of the diagonal elements of C. 
Ideally, the solution of ^ corresponds to subspace-sparse 
representations of the data points, which we use next to infer the 
clustering of the data. In Section |4j we study conditions under 
which the convex optimization program in ^ is guaranteed to 
recover a subspace-sparse representation of each data point. 

2.2 Clustering using Sparse Coefficients 

After solving the proposed optimization program in ([5]), we 
obtain a sparse representation for each data point whose nonzero 
elements ideally correspond to points from the same subspace. 
The next step of the algorithm is to infer the segmentation of the 
data into different subspaces using the sparse coefficients. 

To address this problem, we build a weighted graph Q = 
{VjS^W), where V denotes the set of N nodes of the graph 
corresponding to N data points and E CV xV denotes the set of 
edges between nodes. W G R^^^ is a symmetric non-negative 
similarity matrix representing the weights of the edges, i.e., node 
i is connected to node j by an edge whose weight is equal to Wij . 
An ideal similarity matrix W, hence an ideal similarity graph Q, 
is one in which nodes that correspond to points from the same 
subspace are connected to each other and there are no edges 
between nodes that correspond to points in different subspaces. 

Note that the sparse optimization program ideally recovers to a 
subspace-sparse representation of each point, i.e., a representation 



whose nonzero elements correspond to points from the same 
subspace of the given data point. This provides an immediate 
choice of the similarity matrix as W = \C\ + |C|^. In other 
words, each node i connects itself to a node j by an edge whose 
weight is equal to \cij \-\-\cji\. The reason for the symmetrization 
is that, in general, a data point y^ G Si can write itself as a 
linear combination of some points including y^ G Si. However, 
yj may not necessarily choose y^ in its sparse representation. By 
this particular choice of the weight, we make sure that nodes i 
and j get connected to each other if either y^ or yj is in the 
sparse representation of the other|^ 

The similarity graph built this way has ideally n connected 
components corresponding to the n subspaces, i.e.. 
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Wr. 



r, 



(6) 



where Wi is the similarity matrix of data points in S^. Clustering 
of data into subspaces follows then by applying spectral clustering 
1 26] to the graph Q. More specifically, we obtain the clustering 
of data by applying the Kmeans algorithm ifTTl to the normalized 
rows of a matrix whose columns are the n bottom eigenvectors 
of the symmetric normalized Laplacian matrix of the graph. 

Remark 1: An optional step prior to building the similarity 
graph is to normalize the sparse coefficients as q ^ Ci/||ci||oo. 
This helps to better deal with different norms of data points. 
More specifically, if a data point with a large Euclidean norm 
selects a few points with small Euclidean norms, then the values 
of the nonzero coefficients will generally be large. On the other 
hand, if a data point with a small Euclidean norm selects a few 
points with large Euclidean norms, then the values of the nonzero 
coefficients will generally be small. Since spectral clustering puts 
more emphasis on keeping the stronger connections in the graph, 
by the normalization step we make sure that the largest edge 
weights for all the nodes are of the same scale. 

Algorithm [T] summarizes the SSC algorithm. Note that an 
advantage of spectral clustering, which will be shown in the 
experimental results, is that it provides robustness with respect 
to a few errors in the sparse representations of the data points. 
In other words, as long as edges between points in different 
subspaces are weak, spectral clustering can find the correct 
segmentation. 

2. To obtain a symmetric similarity matrix, one can directly impose the 
constraint of C = C~^ in the optimization program. However, this results in 
increasing the complexity of the optimization program and, in practice, does not 
perform better than the post-symmetrization of C, as described above. See also 
i52i for other processing approaches of the similarity matrix. 



Algorithm 1 : Sparse Subspace Clustering (SSC) 

Input: A set of points {y^j^i lying in a union of n linear 
subspaces {Si}'^^i. 

1: Solve the sparse optimization program ^ in the case of 
uncorrupted data or ([13]) in the case of corrupted data. 

2: Normalize the columns of C as c^ ^ -n-^^ — . 

\\Ci\\oo 

3: Form a similarity graph with N nodes representing the data 
points. Set the weights on the edges between the nodes by 



4: Apply spectral clustering f26\ to the similarity graph. 
Output: Segmentation of the data: l^i, 1^2, • • • , ^n- 



Remark 2: In principle, SSC does not need to know the 
number of subspaces. More specifically, under the conditions of 
the theoretical results in Section |4j in the similarity graph there 
will be no connections between points in different subspaces. 
Thus, one can determine the number of subspaces by finding the 
number of graph components, which can be obtained by analyzing 
the eigenspectrum of the Laplacian matrix of Q |27|. However, 
when there are connections between points in different subspaces, 
other model selection techniques should be employed |53j. 

3 Practical Extensions 

In real- world problems, data are often corrupted by noise and 
sparse outlying entries due to measurement/process noise and ad- 
hoc data collection techniques. In such cases, the data do not lie 
perfectly in a union of subspaces. For instance, in the motion 
segmentation problem, because of the malfunctioning of the 
tracker, feature trajectories can be corrupted by noise or can have 
entries with large errors |21 1. Similarly, in clustering of human 
faces, images can be corrupted by errors due to specularities, cast 
shadows, and occlusions (541 . On the other hand, data points may 
have missing entries, e.g., when the tracker loses track of some 
feature points in a video due to occlusions |55 1. Finally, data may 
lie in a union of affine subspaces, a more general model which 
includes linear subspaces as a particular case. 

In this section, we generalize the SSC algorithm for clustering 
data lying perfectly in a union of linear subspaces, to deal with 
the aforementioned challenges. Unlike state-of-the-art methods, 
which require to run a separate algorithm first to correct the errors 
in the data 1211 . l55l , we deal with these problems in a unified 
framework by incorporating a model for the corruption into the 
sparse optimization program. Thus, the sparse coefficients again 
encode information about memberships of data to subspaces, 
which are used in a spectral clustering framework, as before. 

3.1 Noise and Sparse Outlying Entries 

In this section, we consider clustering of data points that are 
contaminated with sparse outlying entries and noise. Let 



yi = yi 



(7) 



be the i-th data point that is obtained by corrupting an error- 
free point y^, which perfectly lies in a subspace, with a vector of 
sparse outlying entries e^ G M^ that has only a few large nonzero 
elements, i.e., ||e^||o < k for some integer k, and with a noise 
z^ e M^ whose norm is bounded as ||2;^||2 < C for some C > 0- 
Since error-free data points perfectly lie in a union of subspaces. 



using the self-expressiveness property, we can reconstruct y^ G 
Si in terms of other error-free points as 



Vi 



E 



^ijVj' 



(8) 



Note that the above equation has a sparse solution since y^ can 
be expressed as a linear combination of at most d^ other points 
from S^. Rewriting y^ using ^ in terms of the corrupted point 
y^, the sparse outlying entries vector e^, and the noise vector z^ 
and substituting it into ([5]), we obtain 

2/i = X] Cijyj ^Ci^Zi, (9) 

where the vectors e, G M^ and Zj G M^ are defined as 



^e? 



A 

= z^ 



E, 
■E 






(10) 

(11) 



Since ^ has a sparse solution q, e^ and Zi also correspond to 
vectors of sparse outlying entries and noise, respectively. More 
precisely, when a few Cij are nonzero, e^ is a vector of sparse 
outlying entries since it is a linear combination of a few vectors 
of outlying entries in ([TO]). Similarly, when a few cij are nonzero 
and do not have significantly large magnitudeq^ zi is a vector of 
noise since it is linear combination of a few noise vectors in ( pT| ). 
Collecting e^ and zi as columns of the matrices E and Z, 
respectively, we can rewrite ^ in matrix form as 



Y = YC^E^Z, diag(C) = 0. 



(12) 



Our objective is then to find a solution {C^E^Z) for (V2\ , where 
C corresponds to a sparse coefficient matrix, E corresponds to 
a matrix of sparse outlying entries, and Z is a noise matrix. To 
do so, we propose to solve the following optimization program 



min ||C||i+Ae||^||i + 



A., 



(13) 



s.t. Y = YC^E^Z, diag(C)=0, 



where the £i-norm promotes sparsity of the columns of C and 
E, while the Frobenius norm promotes having small entries in 
the columns of Z. The two parameters Ag > and A^ > 
balance the three terms in the objective function. Note that 
the optimization program in ([13]) is convex with respect to the 
optimization variables (C, E, Z), hence, can be solved efficiently 
using convex programming tools. 

When data are corrupted only by noise, we can eliminate E 
from the optimization program in ([13]). On the other hand, when 
the data are corrupted only by sparse outlying entries, we can 
eliminate Z in ( [T3] ). In practice, however, E can also deal with 
small errors due to noise. The following proposition suggests 
setting Xz = o^z/l-^z and Ag = Q^e/^e. where az^o^e > 1 and 



fiz = minmax|i/7?/7 



lie = mmmax||i/||i 



(14) 



The proofs of all theoretical results in the paper are provided in 
the supplementary material. 

Proposition 1: Consider the optimization program ( [T3] ). With- 
out the term Z, if Ag < l//ie, then there exists at least 

3. One can show that, under broad conditions, sum of \cij\ is bounded above 
by the square root of the dimension of the underlying subspace of y^. Theoretical 
guarantees of the proposed optimization program in the case of corrupted data is 
the subject of the current research. 



one data point y^ for which in the optimal solution we have 

(q, e^) = (0, i/^). Also, without the term E, if A^ < 1/ jiz^ then 
there exists at least one data point y^ for which (q, Z() = (0, y^). 

After solving the proposed optimization programs, we use C 
to build a similarity graph and infer the clustering of data using 
spectral clustering. Thus, by incorporating the corruption model 
of data into the sparse optimization program, we can deal with 
clustering of corrupted data, as before, without explicitly running 
a separate algorithm to correct the errors in the data [21il . [i55il . 

3.2 Missing Entries 

We consider now the clustering of incomplete data, where some 
of the entries of a subset of the data points are missing. Note that 
when only a small fraction of the entries of each data point is 
missing, clustering of incomplete data can be cast as clustering of 
data with sparse outlying entries. More precisely, one can fill in 
the missing entries of each data point with random values, hence 
obtain data points with sparse outlying entries. Then clustering of 
the data follows by solving ([13]) and applying spectral clustering 
to the graph built using the obtained sparse coefficients. However, 
the drawback of this approach is that it disregards the fact that 
we know the locations of the missing entries in the data matrix. 
It is possible, in some cases, to cast the clustering of data 
with missing entries as clustering of complete data. To see this, 
consider a collection of data points {y^j^i in R^. Let Ji C 
{1, . . . , I)} denote indices of the known entries of y^ and define 
J = ni=i Ji- Thus, for every index in J, all data points have 
known entries. When the size of J, denoted by | J|, is not small 
relative to the ambient space dimension, D, we can project the 
data, hence, the original subspaces, into a subspace spanned by 
the columns of the identity matrix indexed by J and apply the 
SSC algorithm to the obtained complete data. In other words, 
we can only keep the rows of Y indexed by J, obtain a new 
data matrix of complete data Y e RI'^I^^, and solve the sparse 
optimization program ([13]). We can then infer the clustering of the 
data by applying spectral clustering to the graph built using the 
sparse coefficient matrix. Note that the approach described above 
is based on the assumption that J is nonempty. Addressing the 
problem of subspace clustering with missing entries when J is 
empty or has a small size is the subject of the future research. 

3.3 Affine Subspaces 

In some real- world problems, the data lie in a union of affine 
rather than linear subspaces. For instance, the motion segmen- 
tation problem involves clustering of data that lie in a union of 
3-dimensional affine subspaces O, ||5? I. A naive way to deal 
with this case is to ignore the affine structure of the data and 
perform clustering as in the case of linear subspaces. This comes 
from the fact that a d^ -dimensional affine subspace S^ can be 
considered as a subset of a {d£ -\- 1) -dimensional linear subspace 
that includes Si and the origin. However, this has the drawback 
of possibly increasing the dimension of the intersection of two 
subspaces, which in some cases can result in indistinguishability 
of subspaces from each other. For example, two different lines 
X = —1 and X = +1 in the x-y plane form the same 2- 
dimensional linear subspace after including the origin, hence 
become indistinguishable. 

To directly deal with affine subspaces, we use the fact that any 
data point y^ in an affine subspace Si of dimension di can be 
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Fig. 4. Left: the three 1-dimensional subspaces are independent 
as they span the 3-dimensional space and the sum of their dimen- 
sions is also 3. Right: the three 1-dimensional are disjoint as any 
two subspaces intersect at the origin. 



written as an affine combination of di -\- 1 other points from S^. 
In other words, a sparse solution of 



y^ = Yci, I'^Ci = 1, Cii = 0, 



(15) 



corresponds to di -\- 1 other points that belong to Si containing 
y^. Thus, to cluster data points lying close to a union of affine 
subspaces, we propose to solve the sparse optimization program 



(16) 



min ||C||i+Ae||^|| 

s.t. Y = YC^E^Z, 1^C = 1^, diag(C) = 0, 

which, in comparison to ( [T3] ) for the case of linear subspaces, 
includes additional linear equality constraints. Note that ([16]) can 
deal with linear subspaces as well since a linear subspace is also 
an affine subspace. 

4 Subspace-Sparse Recovery Theory 

The underlying assumption for the success of the SSC algorithm 
is that the proposed optimization program recovers a subspace- 
sparse representation of each data point, i.e., a representation 
whose nonzero elements correspond to the subspace of the given 
point. In this section, we investigate conditions under which, for 
data points that lie in a union of linear subspaces, the sparse 
optimization program in (|4]) recovers subspace-sparse represen- 
tations of data points. We investigate recovery conditions for 
two classes of subspace arrangements: independent and disjoint 
subspace models |36|. 

Definition 1: A collection of subspaces {Si}'^^^ is said to be 
independent if d\m{®'^^^Si) = ^^^-^ dim(5i), where denotes 
the direct sum operator 

As an example, the three 1-dimensional subspaces shown in 
Figure |4] (left) are independent since they span a 3-dimensional 
space and the sum of their dimensions is also 3. On the other 
hand, the subspaces shown in Figure [4] (right) are not independent 
since they span a 2 -dimensional space while the sum of their 
dimensions is 3. 

Definition 2: A collection of subspaces {Si}^^^ is said to be 
disjoint if every pair of subspaces intersect only at the origin. In 
other words, for every pair of subspaces we have dmi{Si®Sj) = 
dim(5i) + dim(5j). 

As an example, both subspace arrangements shown in Figure [4] are 
disjoint since each pair of subspaces intersect at the origin. Note 
that, based on the above definitions, the notion of disjointness 
is weaker than independence as an independent subspace model 
is always disjoint while the converse is not necessarily true. An 



important notion that can be used to characterize two disjoint 
subspaces is the smallest principal angle, defined as follows. 

Definition 3: The smallest principal angle between two sub- 
spaces Si and Sj, denoted by Oij, is defined as 



cos {6 ij) = max 



vjv^ 



v,eSi,vjeSj \\Vi\\2\\Vj\\2 



(17) 



Note that two disjoint subspaces intersect at the origin, hence their 
smallest principal angle is greater than zero and cos{Oij) G [0, 1). 

4.1 Independent Subspace Model 

In this section, we consider data points that lie in a union of 
independent subspaces, which is the underlying model of many 
subspace clustering algorithms. We show that the ^i -minimization 
program in (|4]) and more generally the ^^ -minimization in ([3]) for 
q < oo always recover subspace-sparse representations of the data 
points. More specifically, we show the following result. 

Theorem 1: Consider a collection of data points drawn from 
n independent subspaces {Si}'^^i of dimensions {di}^^-^. Let Yi 
denote Ni data points in Si, where rank(l^i) = di, and let Y-i 
denote data points in all subspaces except Si. Then, for every Si 
and every nonzero y in Si, the £q-minimization program 



argmm 



s.t. y = [Yi Y_i] 



, (18) 



for q < oo, recovers a subspace-sparse representation, i.e., c* ^ 
and c*_ = 0. 

Note that the subspace-sparse recovery holds without any as- 
sumption on the distribution of the data points in each subspace, 
other than rank(l^i) = di. This comes at the price of having a 
more restrictive model for the subspace arrangements. Next, we 
will show that for the more general class of disjoint subspaces, 
under appropriate conditions on the relative configuration of 
the subspaces as well as the distribution of the data in each 
subspace, the £i -minimization in Q recovers subspace-sparse 
representations of the data points. 

4.2 Disjoint Subspace Model 

We consider now the more general class of disjoint subspaces and 
investigate conditions under which the optimization program in 
(|4]) recovers a subspace-sparse representation of each data point. 
To that end, we consider a vector x in the intersection of Si with 
®j^iSj and let the optimal solution of the ^i -minimization when 
we restrict the dictionary to data points from Si be 



Cbi 



argmin||a||i s.t. 



X 



Y,a. 



(19) 



We also let the optimal solution of the ^i -minimization when we 
restrict the dictionary to points from all subspaces except Si be 



argmin||a||i s.t. 



-Q 



(20) 



We show in the supplementary material that the SSC algorithm 
succeeds in recovering subspace-sparse representations of data 
points in each Si, if for every nonzero x in the intersection of 



Si with ^j^iSj, the ^i-norm of the solution of ([19]) is strictly 
smaller than the ^i-norm of the solution of ([2Q|), i.e.. 



\/x eSiH {®j^iSj),x ^ 



ai 



< a_ 



(21) 



More precisely, we show the following result. 

Theorem 2: Consider a collection of data points drawn from 
n disjoint subspaces {Si}^^^ of dimensions {di}^^-^. Let Yi 
denote Ni data points in Si, where rank(l^^) = di, and let Y-i 
denote data points in all subspaces except Si. The ii-minimization 



argmm 



s.t. y = [Yi Y_i] 



(22) 



recovers a subspace-sparse representation of every nonzero y in 
Si, i.e., c* 7^ and cl = 0, if and only if ^T\) holds. 

While the necessary and sufficient condition in ^T\) guarantees 
a successful subspace-sparse recovery via the ^i -minimization 
program, it does not explicitly show the relationship between the 
subspace arrangements and the data distribution for the success of 
the ^1 -minimization program. To establish such a relationship, we 
show that II a^ 111 < Pi, where Pi depends on the singular values 
of data points in Si, and /3_^ < ||a_^||i, where /3_i depends on 
the subspace angles between Si and other subspaces. Then, the 
sufficient condition /3i < P-i establishes the relationship between 
the subspace angles and the data distribution under which the £i- 
minimization is successful in subspace-sparse recovery, since it 
implies that 

||a,||i<ft</3_, <||a_,||i, (23) 

i.e., the condition of Theorem [2] holds. 

Theorem 3: Consider a collection of data points drawn from 
n disjoint subspaces {Si}^^^ of dimensions {di}^^^. Let W^ be 
the set of all full- rank submatrices Yi G R^^^'^* ofYi, where 
rank(l^^) = di. If the condition 

max cFd^Yi) > ^/d'i\\Y-i\\l^2 maxcos(6>i_^-) (24) 



Yi^ 



j^i 



holds, then for every nonzero y in Si, the ii -minimization in ([22]) 
recovers a subspace-sparse solution, i.e., c* ^ and cl = OPT 

Loosely speaking, the sufficient condition in Theorem |3] states 
that if the smallest principal angle between each Si and any other 
subspace is larger than a certain value that depends on the data 
distribution in Si, then the subspace-sparse recovery holds. This 
bound can be rather high when the norms of the data points are 
oddly distributed, e.g., when the maximum norm of data points 
in Si is much smaller than the maximum norm of data points in 
all other subspaces. Since the segmentation of the data does not 
change when data points are scaled, we can apply SSC to linear 
subspaces after normalizing the data points to have unit Euclidean 
norms. In this case, the sufficient condition in ([241) reduces to 

(25) 



max adi{Yi) > \fdi maxcos(^ij). 



Remark 3: For independent subspaces, the intersection of a 
subspace with the direct sum of other subspaces is the origin, 
hence, the condition in ( [2T] ) always holds. As a result, from 
Theorem |2j the ^i -minimization always recovers subspace-sparse 
representations of data points in independent subspaces. 



4. In fact, ai and a-i depend on a?, Y i, and Y -%. Since this dependence is 5. The induced norm H'K- 

clear from the context, we drop the arguments in aiix.Y i) and a_^(a3,V_i). oiY -i. 



1 1,2 denotes the maximum -^2 -norm of the columns 






Fig. 5. Left: for any nonzero x in the intersection of Si and & © <S3, the polytope aVi reaches x for a smaller a than aV-i, hence, 
subspace-sparse recovery holds. Middle: when the subspace angle decreases, the polytope aV-i reaches x for a smaller a than aVi. 
Right: when the distribution of the data in <Si becomes nearly degenerate, in this case close to a line, the polytope aV-i reaches x for a 
smaller a than aVi. In both cases, in the middle and right, the subspace-sparse recovery does not hold for points at the intersecion. 



Remark 4: The condition in ^T\} is closely related to the 
nuUspace property in the sparse recovery literature 1561 , |57]| , 
EH, EEl- The key difference, however, is that we only require 
the inequality in pT] ) to hold for the optimal solutions of ([19]) and 
( [2Q| ) instead of any feasible solution. Thus, while the inequality 
can be violated for many feasible solutions, it can still hold for 
the optimal solutions, guaranteeing successful subspace-sparse 
recovery from Theorem |2] Thus, our result can be thought of as 
a generalization of the nullspace property to the multi- subspace 
setting where the number of points in each subspace is arbitrary. 

4.3 Geometric interpretation 

In this section, we provide a geometric interpretation of the 
subspace-sparse recovery conditions in ( [2T] ) and ([24]). To do so, 
it is necessary to recall the relationship between the ^i-norm of 
the optimal solution of 



min||a||i s.t. 



Ba, 



(26) 



and the symmetrized convex polytope of the columns of B 11591 . 
More precisely, if we denote the columns of B by bi and define 
the symmetrized convex hull of the columns of B by 



V = conv(±6i,±62,- • ■), 



(27) 



then the ^i-norm of the optimal solution of ( [261 ) corresponds to 
the smallest a > such that the scaled polytope aV reaches x 
1591 . Let us denote the symmetrized convex polytopes of Yi and 
Y-ihy Vi and V-i, respectively. Then the condition in ( [2T] ) has 
the following geometric interpretation: 

the subspace-sparse recovery in Si holds if and only if for 
any nonzero x in the intersection of Si and ^j^iSj, aVi 
reaches x before aV-i, i.e., for a smaller a. 

As shown in the left plot of Figure [5] for x in the intersection 
of <Si and ^2 ^3, the polytope aVi reaches x before aV-i, 
hence the subspace-sparse recovery condition holds. On the other 
hand, when the principal angles between Si and other subspaces 
decrease, as shown in the middle plot of Figure [5] the subspace- 
sparse recovery condition does not hold since the polytope aV-i 
reaches x before aVi. Also, as shown in the right plot of Figure[5] 
when the distribution of the data in Si becomes nearly degenerate, 
in this case close to a 1 -dimensional subspace orthogonal to 
the direction of x, then the subspace-sparse recovery condition 
does not hold since aV-i reaches x before aVi. Note that the 
sufficient condition in ( [24] ) translates the relationship between the 
polytopes, mentioned above, explicitly in terms of a relationship 
between the subspace angles and the singular values of the data. 



■ ..J k- _■, 



Fig. 6. Coefficient matrix obtained from the solution of ([30| for data 
points in two subspaces. Left: \r = 0. Right: \r = 10. Increasing 
\r results in concentration of the nonzero elements in a few rows of 
the coefficient matrix, hence choosing a few common data points. 



5 Graph Connectivity 

In the previous section, we studied conditions under which 
the proposed ^1 -minimization program recovers subspace-sparse 
representations of data points. As a result, in the similarity graph, 
the points that lie in different subspaces do not get connected to 
each other. On the other hand, our extensive experimental results 
on synthetic and real data show that data points in the same 
subspace always form a connected component of the graph, hence, 
for n subspaces the similarity graph has n connected components. 
|60| has theoretically verified the connectivity of points in the 
same subspace for 2 and 3 dimensional subspaces. However, it 
has shown that, for subspaces of dimensions greater than or equal 
to 4, under odd distribution of the data, it is possible that points 
in the same subspace form multiple components of the graph. 

In this section, we consider a regularization term in the sparse 
optimization program that promotes connectivity of the points 
within each subspace|jWe use the idea that if data points in each 
subspace choose a few common points from the same subspace in 
their sparse representations, then they form a single component 
of the similarity graph. Thus, we add to the sparse optimization 
program the regularization term 



|r,0 



N 



i(iicii2>o), 



(28) 



where I(-) denotes the indicator function and c* denotes the z-th 
row of C . Hence, minimizing ( [28] ) corresponds to minimizing the 
number of nonzero rows of C 16T1 , 1621 , 1631 , i.e., choosing a few 
common data points in the sparse representation of each point. 

6. Another approach to deal with the connectivity issue is to analyze the sub- 
spaces corresponding to the components of the graph and merge the components 
whose associated subspaces have a small distance from each other, i.e., have a 
small principal angle. However, the result can be sensitive to the choice of the 
dimension of the subspaces to fit to each component as well as the threshold value 
on the principal angles to merge the subspaces. 
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Fig. 7. Left: three 1-dimensional subspaces in R^ with normalized data points. IVIiddle: Ci corresponds to the solution of ([30]) for \r == 0. 
The similarity graph of Ci has three components corresponding to the three subspaces. Right: C2 corresponds to the solution of ([30} for 
\r -^ +00 and G (0, ff ). The similarity graph of C2 has only one connected component. 



Since a minimization problem that involves ( [28] ) is in general 
NP-hard, we consider its convex relaxation as 



Subspace-Sparse Recovery Error 



Subspace Clustering Error 



|r,l 



N 

"El 



|C1|2. 



(29) 



Thus, to increase the connectivity of data points from the same 
subspace in the similarity graph, we propose to solve 

min||C||i+A^||C||^,i s.t. Y = YC, diag(C) = 0, (30) 

where A^ > sets the trade-off between the sparsity of the 
solution and the connectivity of the graph. Figure [6] shows how 
adding this regularization term promotes selecting common points 
in sparse representations. The following example demonstrates the 
reason for using the row- sparsity term as a regularizer but not as 
an objective function instead of the ^i-norm. 

Example 1: Consider the three 1-dimensional subspaces in 
M? , shown in Figure |7J where the data points have unit Euclidean 
norms and the angle between Si and ^2 as well as between Si and 
^3 is equal to 9. Note that in this example, the sufficient condition 
in ( [24] ) holds for all values of 6> G (0, |). As a result, the solution 
of ( [30] ) with Ar = recovers a subspace-sparse representation for 
each data point, which in this example is uniquely given by Ci 
shown in Figure [7] Hence, the similarity graph has exactly 3 
connected components corresponding to the data points in each 
subspace. Another feasible solution of ( [30] ) is given by C2, shown 
in Figure [7] where the points in Si choose points from ^2 and ^3 
in their representations. Hence, the similarity graph has only one 
connected component. Note that for a large range of subspace 
angles 6> G (0, ^) we have 



||C2||.,i = v/l6 + 2/cos2(^) < ||Ci||,,i = 6. (31) 

As a result, for large values of A^, i.e., when we only minimize the 
second term of the objective function in ( [3Q| , we cannot recover 
subspace-sparse representations of the data points. This suggests 
using the row-sparsity regularizer with a small value of A^. 

6 Experiments with Synthetic Data 

In Section [4] we showed that the success of the £1 -minimization 
for subspace-sparse recovery depends on the principal angles 
between subspaces and the distribution of the data in each 
subspace. In this section, we verify this relationship through 
experiments on synthetic data. 

We consider three disjoint subspaces {Si}^^^ of the same 
dimension d embedded in the /^-dimensional ambient space. To 
make the problem hard enough so that every data point in a sub- 
space can also be reconstructed as a linear combination of points 
in other subspaces, we generate subspace bases {Ui G M^^^jf^^ 
such that each subspace lies in the direct sum of the other two 




Fig. 8. Subspace-sparse recovery error (left) and subspace 
clustering error (right) for three disjoint subspaces. Increasing the 
number of points or smallest principal angle decreases the errors. 



subspaces, i.e., rank([t/i U2 t/3]) = 2d. In addition, we 
generate the subspaces such that the smallest principal angles 
9i2 and 6^23 are equal to 6. Thus, we can verify the effect of 
the smallest principal angle in the subspace-sparse recovery by 
changing the value of 0. To investigate the effect of the data 
distribution in the subspace-sparse recovery, we generate the same 
number of data points, Ng, in each subspace at random and 
change the value of Ng . Typically, as the number of data points 
in a subspace increases, the probability of the data being close to 
a degenerate subspace decreases Q 

After generating three d-dimensional subspaces associated to 
{O^Ng), we solve the ^1 -minimization program in ^ for each 
data point and measure two different errors. First, denoting the 
sparse representation of y^ G Sk^ by cj = [cj^ cj c^], 
with Cij corresponding to points in Sj , we measure the subspace- 
sparse recovery error by 



ssr error : 



3A^, 






l^i/cJIl 



)e[0,l], (32) 



where each term inside the summation indicates the fraction of 
the ^i-norm of Ci that comes from points in other subspaces. The 
error being zero corresponds to y^ choosing points only in its 
own subspace, while the error being equal to one corresponds to 
y^ choosing points from other subspaces. Second, after building 
the similarity graph using the sparse coefficients and applying 
spectral clustering, we measure the subspace clustering error by 

# of misclassified points 



subspace clustering error = 



total # of points 



(33) 



In our experiments, we set the dimension of the ambient space 
to D = 50. We change the smallest principal angle between 
subspaces as 6> G [6, 60] degrees and change the number of points 
in each subspace sls Ng e [d-\-l^ 32d]. For each pair (6>, Ng) we 

7. To remove the effect of different scalings of data points, i.e., to consider 
only the effect of the principal angle and number of points, we normalize the 
data points. 
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Fig. 9. Left: percentage of pairs of subspaces whose smallest prin- 
cipal angle is smaller than a given value. Right: average percentage 
of data points in pairs of subspaces that have one or more of their 
iC-nearest neighbors in the other subspace. 



Hopkins 155 Dataset 




Extended YaleB Dataset 



Fig. 10. Left: singular values of several motions in the Hop- 
kins 155 dataset. Each motion corresponds to a subspace of 
dimension at most 4. Right: singular values of several faces in the 
Extended Yale B dataset. Each subject corresponds to a subspace 
of dimension around 9. 



compute the average of the errors in ( |32| ) and p3\ over 100 trials 
(randomly generated subspaces and data points). The results for 
(i = 4 are shown in Figure M Note that when either or Ng is 
small, both the subspace-sparse recovery error and the clustering 
error are large, as predicted by our theoretical analysis. On the 
other hand, when or Ng increases, the errors decrease, and for 
{O^Ng) sufficiently large we obtain zero errors. The results also 
verify that the success of the clustering relies on the success of the 
^1 -minimization in recovering subspace-sparse representations of 
data points. Note that for small 6 as we increase Ng, the subspace- 
sparse recovery error is large and slightly decreases, while the 
clustering error increases. This is due to the fact that increasing 
the number of points, the number of undesirable edges between 
different subspaces in the similarity graph increases, making the 
spectral clustering more difficult. Note also that, for the values 
of (O^Ng) where the subspace-sparse recovery error is zero, i.e., 
points in different subspaces are not connected to each other in the 
similarity graph, the clustering error is also zero. This implies that, 
in such cases, the similarity graph has exactly three connected 
components, i.e., data points in the same subspace form a single 
component of the graph. 

7 Experiments with Real Data 

In this section, we evaluate the performance of the SSC algorithm 
in dealing with two real- world problems: segmenting multiple 
motions in videos (Fig. [T]) and clustering images of human faces 
(Fig. [2]). We compare the performance of SSC with the best state- 
of-the-art subspace clustering algorithms: LSA 1221 . SCC ||28]| , 
LRR |i38J, and LRSC |41|. 



Implementation details. We implement the SSC optimization 
algorithm in (T3\ using an Alternating Direction Method of 



Multipliers (ADMM) framework ISOl , (641 whose derivation is 
provided in the supplementary material. For the motion segmen- 
tation experiments, we use the noisy variation of the optimization 
program ([13]), i.e., without the term E, with the affine constraint, 
and choose A^ = 800///^ in all the experiments (/i^ is defined 
in ([14])). For the face clustering experiments, we use the sparse 
outlying entries variation of the optimization program ([TSj, i.e., 
without the term Z, and choose Ag = 20//ie in all the experiments 
(/ie is defined in ([14])). It is also worth mentioning that SSC 
performs better with the ADMM approach than with general 
interior point solvers |49|, which typically return many small 
nonzero coefficients, degrading the spectral clustering result. 

For the state-of-the-art algorithms, we use the codes provided 
by their authors. For LSA, we use K = S nearest neighbors and 
dimension d = 4, to fit local subspaces, for motion segmentation 
and use K = 7 nearest neighbors and dimension d = 5 for 
face clustering. For SCC, we use dimension d = 3, for the 
subspaces, for motion segmentation and (i = 9 for face clustering. 
For LRR, we use A = 4 for motion segmentation and A = 0.18 
for face clustering. Note that the LRR algorithm according to 
(381, similar to SSC, applies spectral clustering to a similarity 
graph built directly from the solution of its proposed optimization 
program. However, the code of the algorithm applies a heuristic 
post-processing step, similar to |65J, to the low-rank solution 
prior to building the similarity graph ll40l . Thus, to compare 
the effectiveness of sparse versus low-rank objective function 
and to investigate the effect of the post-processing step of LRR, 
we report the results for both cases of without (LRR) and with 
(LRR-H) the heuristic post-processing step|j For LRSC, we use 
the method in |41, Lemma 1] with parameter r = 420 for 
motion segmentation, and an ALM variant of the method in BTl 
Section 4.2] with parameters a = 3r = 0.5 * (1.25/cri(l^))^, 
7 = 0.008 and p = 1.5 for face clustering. Finally, as LSA and 
SCC need to know the number of subspaces a priori and the 
estimation of the number of subspaces from the eigenspectrum 
of the graph Laplacian in the noisy setting is often unreliable, 
in order to have a fair comparison, we provide the number of 
subspaces as an input to all the algorithms. 

Datasets and some statistics. For the motion segmentation prob- 
lem, we consider the Hopkins 155 dataset |66|, which consists 
of 155 video sequences of 2 or 3 motions corresponding to 2 
or 3 low-dimensional subspaces in each video f2l, f67|. For the 
face clustering problem, we consider the Extended Yale B dataset 
1 68], which consists of face images of 38 human subjects, where 
images of each subject lie in a low-dimensional subspace |3|. 

Before describing each problem in detail and presenting the 
experimental results, we present some statistics on the two 
datasets that help to better understand the challenges of subspace 
clustering and the performance of different algorithms. First, we 
compute the smallest principal angle for each pair of subspaces, 
which in the motion segmentation problem corresponds to a pair 
of motions in a video and in the face clustering problem corre- 
sponds to a pair of subjects. Then, we compute the percentage 
of the subspace pairs whose smallest principal angle is below 
a certain value, which ranges from to 90 degrees. Figure [9] 
(left) shows the corresponding graphs for the two datasets. As 

8. The original published code of LRR contains the function "compacc.m" 
for computing the misclassification rate, which is erroneous. We have used the 
correct code for computing the misclassification rate and as a result, the reported 
performance for LRR-H is different from the published results in 13 8 J and [40 J . 



TABLE 1 

Clustering error (%) of different algorithms on the Hopkins 155 dataset with 
the 2F-dimensional data points. 



TABLE 2 

Clustering error (%) of different algorithms on the Hopkins 155 dataset with 
the 4n-dimensional data points obtained by applying PCA. 



Algorithms 
2 Motions 


LSA 1 sec 1 LRR 


LRR-H 


LRSC 


ssc 




Algorithms 
2 Motions 


LSA 1 sec 1 LRR 


LRR-H 


LRSC 


SSC 


Mean 
Median 


4.23 
0.56 


2.89 
0.00 


4.10 
0.22 


2.13 
0.00 


3.69 
0.29 


1.52 (2.07) 
0.00 (0.00) 




Mean 
Median 


3.61 
0.51 


3.04 
0.00 


4.83 
0.26 


3.41 
0.00 


3.87 
0.26 


1.83(2.14) 
0.00 (0.00) 
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3 Motions 










Mean 
Median 


7.02 
1.45 


8.25 
0.24 


9.89 
6.22 


4.03 

1.43 


7.69 
3.80 


4.40(5.27) 
0.56(0.40) 




Mean 
Median 


7.65 
1.27 


7.91 
1.14 


9.89 
6.22 


4.86 
1.47 


7.72 
3.80 


4.40(5.29) 
0.56 (0.40) 


All 












All 










Mean 
Median 


4.86 
0.89 


4.10 
0.00 


5.41 
0.53 


2.56 
0.00 


4.59 
0.60 


2.18 (2.79) 
0.00 (0.00) 




Mean 
Median 


4.52 
0.57 


4.14 
0.00 


5.98 
0.59 


3.74 
0.00 


4.74 

0.58 


2.41 (2.85) 
0.00 (0.00) 



shown, subspaces in both datasets have relatively small principal 
angles. In the Hopkins- 155 dataset, principal angles between 
subspaces are always smaller than 10 degrees, while in the 
Extended Yale B dataset, principal angles between subspaces are 
between 10 and 20 degrees. Second, for each pair of subspaces, 
we compute the percentage of data points that have one or more of 
their iC-nearest neighbors in the other subspace. Figure [9] (right) 
shows the average percentages over all possible pairs of subspaces 
in each dataset. As shown, in the Hopkins- 155 dataset for almost 
all data points, their few nearest neighbors belong to the same 
subspace. On the other hand, for the Extended Yale B dataset, 
there is a relatively large number of data points whose nearest 
neighbors come from the other subspace. This percentage rapidly 
increases as the number of nearest neighbors increases. As a 
result, from the two plots in Figure |9] we can conclude that in the 
Hopkins 155 dataset the challenge is that subspaces have small 
principal angles, while in the Extended Yale B dataset, beside the 
principal angles between subspaces being small, the challenge is 
that data points in a subspace are very close to other subspaces. 

7.1 Motion Segmentation 

Motion segmentation refers to the problem of segmenting a 
video sequence of multiple rigidly moving objects into multiple 
spatiotemporal regions that correspond to different motions in 
the scene (Fig.[T]). This problem is often solved by extracting and 
tracking a set of N feature points {xfi G M^}^^ through the 
frames / = 1, . . . , F of the video. Each data point y^, which is 
also called a feature trajectory, corresponds to a 2F-dimensional 
vector obtained by stacking the feature points Xfi in the video as 



ATT T 



3)2F 



'^Fi\ 



(34) 



Motion segmentation refers to the problem of separating these 
feature trajectories according to their underlying motions. Under 
the affine projection model, all feature trajectories associated with 
a single rigid motion lie in an affine subspace of R^^ of dimen- 
sion at most 3, or equivalently lie in a linear subspace of R^^ of 
dimension at most 4 |2|, |67|. Therefore, feature trajectories of 
n rigid motions lie in a union of n low-dimensional subspaces of 
R^^. Hence, motion segmentation reduces to clustering of data 
points in a union of subspaces. 

In this section, we evaluate the performance of the SSC 
algorithm as well as that of state-of-the-art subspace clustering 
methods for the problem of motion segmentation. To do so, we 
consider the Hopkins 155 dataset 1661 that consists of 155 video 
sequences, where 120 of the videos have two motions and 35 of 
the videos have three motions. On average, in the dataset, each 



sequence of 2 motions has N = 266 feature trajectories and 
F = 30 frames, while each sequence of 3 motions has N = 398 
feature trajectories and F = 29 frames. The left plot of Figure 
10 shows the singular values of several motions in the dataset. 
Note that the first four singular values are nonzero and the rest 
of the singular values are very close to zero, corroborating the 4- 
dimensionality of the underlying linear subspace of each motion |j 
In addition, it shows that the feature trajectories of each video can 
be well modeled as data points that almost perfectly lie in a union 
of linear subspaces of dimension at most 4. 

The results of applying subspace clustering algorithms to 
the dataset when we use the original 2F-dimensional feature 
trajectories and when we project the data into a 4n-dimensional 
subspace (n is the number of subspaces) using PCA are shown 
in Table [T] and Table |2] respectively. From the results, we make 
the following conclusions: 

- In both cases, SSC obtains a small clustering error outper- 
forming the other algorithms. This suggests that the separation of 
different motion subspaces in terms of their principal angles and 
the distribution of the feature trajectories in each motion subspace 
are sufficient for the success of the sparse optimization program, 
hence clustering. The numbers inside parentheses show the clus- 
tering errors of SSC without normalizing the similarity matrix, 
i.e., without step 2 in Algorithm [T] Notice that, as explained in 
Remark [T] the normalization step helps to improve the clustering 
results. However, this improvement is small (about 0.5%), i.e., 
SSC performs well with or without the post-processing of C. 

- Without post-processing of its coefficient matrix, LRR has 
higher errors than other algorithms. On the other hand, post- 
processing of the low-rank coefficient matrix significantly im- 
proves the clustering performance (LRR-H). 

- LRSC tries to find a noise-free dictionary for data while finding 
their low-rank representation. This helps to improve over LRR. 
Also, note that the errors of LRSC are higher than the reported 
ones in ll4T1l . This comes from the fact that BTIl has used the 
erroneous compacc.m function from [32] to compute the errors. 

- The clustering performances of different algorithms when using 
the 2F-dimensional feature trajectories or the 4n-dimensional 
PCA projections are close. This comes from the fact that the 
feature trajectories of n motions in a video almost perfectly 
lie in a 4n-dimensional linear subspace of the 2F-dimensional 
ambient space. Thus, projection using PCA onto a 4n-dimensional 
subspace preserves the structure of the subspaces and the data, 

9. If we subtract the mean of the data points in each motion from them, the 
singular values drop at 3, showing the 3 -dimensionality of the affine subspaces. 
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Fig. 1 1 . Clustering error (%) of SSC as a function of az in tine 
regularization parameter \z = az/iiz for tine two cases of clustering 
of 2F-dimensional data and 4n-dimensional data obtained by PCA. 



hence, for each algorithm, the clustering error in Table [T] is close 
to the error in Table O 

In Figure [TT] we show the effect of the regularization parameter 
^z = (^z/f^z in the clustering performance of SSC over the entire 
Hopkins 155 dataset. Note that the clustering errors of SSC as 
a function of az follow a similar pattern using both the 2F- 
dimensional data and the 4n-dimensional data. Moreover, in both 
cases the clustering error is less than 2.5% in both cases for a 
large range of values of az . 

Finally, notice that the results of SSC in Tables [T][2] do not 
coincide with those reported in |35|. This is mainly due to the fact 
in 1351 we used random projections for dimensionality reduction, 
while here we use PCA or the original 2F-dimensional data. In 
addition, in fsH we used a CVX solver to compute a subspace- 
sparse representation, while here we use an ADMM solver. Also, 
notice that we have improved the overall clustering error of LSA, 
for the the case of 4n-dimensional data, from 4.94%, reported in 
1661 , 1351 , to 4.52%. This is due to using K = S nearest neighbors 
here instead of i^ = 5 in 1661. 



7.2 Face Clustering 

Given face images of multiple subjects, acquired with a fixed pose 
and varying illumination, we consider the problem of clustering 
images according to their subjects (Fig. |2]). It has been shown 
that, under the Lambertian assumption, images of a subject with 
a fixed pose and varying illumination lie close to a linear subspace 
of dimension 9 |3|. Thus, the collection of face images of multiple 
subjects lie close to a union of 9-dimensional subspaces. 

In this section, we evaluate the clustering performance of SSC 
as well as the state-of-the-art methods on the Extended Yale B 
dataset f68l. The dataset consists of 192 x 168 pixel cropped face 
images of n = 38 individuals, where there are Ni = 64 frontal 
face images for each subject acquired under various lighting 
conditions. To reduce the computational cost and the memory 
requirements of all algorithms, we downsample the images to 
48 X 42 pixels and treat each 2, 016-dimensional vectorized image 
as a data point, hence, D = 2, 016. The right plot in Figure 10 
shows the singular values of data points of several subjects in the 
dataset. Note that the singular value curve has a knee around 9, 
corroborating the approximate 9 -dimensionality of the face data 
in each subject. In addition, the singular values gradually decay 
to zero, showing that the data are corrupted by errors. Thus, the 
face images of n subjects can be modeled as corrupted data points 
lying close to a union of 9-dimensional subspaces. 



To study the effect of the number of subjects in the clustering 
performance of different algorithms, we devise the following 
experimental setting: we divide the 38 subjects into 4 groups, 
where the first three groups correspond to subjects 1 to 10, 11 to 
20, 21 to 30, and the fourth group corresponds to subjects 31 to 
38. For each of the first three groups we consider all choices of 
n e {2,3,5,8,10} subjects and for the last group we consider 
all choices of n G {2, 3, 5, 8}p| Finally, we apply clustering 
algorithms for each trial, i.e., each set of n subjects. 

7.2. 1 Applying RPCA separately on each subject 

As shown by the SVD plot of the face data in Figure [T0| (right), 
the face images do not perfectly lie in a linear subspace as they 
are corrupted by errors. In fact, the errors correspond to the 
cast shadows and specularities in the face images and can be 
modeled as sparse outlying entries. As a result, it is important 
for a subspace clustering algorithm to effectively deal with data 
with sparse corruptions. 

In order to validate the fact that corruption of faces is due 
to sparse outlying errors and show the importance of dealing 
with corruptions while clustering, we start by the following 
experiment. We apply the Robust Principal Component Analysis 
(RPCA) algorithm 13 2 J to remove the sparse outlying entries 
of the face data in each subject. Note that in practice, we do 
not know the clustering of the data beforehand, hence cannot 
apply the RPCA to the faces of each subject. However, as we 
will show, this experiment illustrates some of the challenges of 
the face clustering and validates several conclusions about the 
performances of different algorithms. 

Table [3] shows the clustering error of different algorithms after 
applying RPCA to the data points in each subject and removing 
the sparse outlying entries, i.e., after bringing the data points back 
to their low-dimensional subspaces. From the results, we make 
the following conclusions: 

- The clustering error of SSC is very close to zero for different 
number of subjects suggesting that SSC can deal well with face 
clustering if the face images are corruption free. In other words, 
while the data in different subspaces are very close to each other, 
as shown in Figure [9] (right), the performance of the SSC is 
more dependent on the principal angles between subspaces which, 
while small, are large enough for the success of SSC. 

- The LRR and LRSC algorithms have also low clustering errors 
(LRSC obtains zero errors) showing the effectiveness of removing 
sparse outliers in the clustering performance. On the other hand, 
while LRR-H has a low clustering error for 2, 3, and 5 subjects, it 
has a relatively large error for 8 and 10 subjects, showing that the 
post processing step on the obtained low-rank coefficient matrix 
not always improves the result of LRR. 

- For LSA and SCC, the clustering error is relatively large and the 
error increases as the number of subjects increases. This comes 
from the fact that, as shown in Figure [9] (right), for face images, 
the neighborhood of each data point contains points that belong 
to other subjects and, in addition, the number of neighbors from 
other subjects increases as we increase the number of subjects. 



10. Note that choosing n out of 38 leads to extremely large number of trials. 
Thus, we have devised the above setting in order to have a repeatable experiment 
with a reasonably large number of trials for each n. 



TABLE 3 

Clustering error (%) of different algorithms on the Extended Yale B dataset 
after applying RPCA separately to the data points in each subject. 



TABLE 5 

Clustering error (%) of different algorithms on the Extended Yale B dataset 
without pre-processing the data. 



Algorithm LSA 

2 Subjects 


sec 


LRR 


LRR-H 


LRSC 


ssc 




Algorithm LSA 

2 Subjects 


sec 


LRR 


LRR-H 


LRSC 


SSC 


Mean 
Median 


6.15 
0.00 


1.29 
0.00 


0.09 
0.00 


0.05 
0.00 


0.00 
0.00 


0.06 
0.00 




Mean 
Median 


32.80 
47.66 


16.62 

7.82 


9.52 
5.47 


2.54 

0.78 


5.32 
4.69 


1.86 
0.00 


3 Subjects 














3 Subjects 












Mean 
Median 


11.67 
2.60 


19.33 

8.59 


0.12 
0.00 


0.10 
0.00 


0.00 
0.00 


0.08 
0.00 




Mean 
Median 


52.29 
50.00 


38.16 
39.06 


19.52 
14.58 


4.21 
2.60 


8.47 
7.81 


3.10 
1.04 


5 Subjects 














5 Subjects 












Mean 
Median 


21.08 
19.21 


47.53 
47.19 


0.16 
0.00 


0.15 
0.00 


0.00 
0.00 


0.07 
0.00 




Mean 
Median 


58.02 
56.87 


58.90 
59.38 


34.16 
35.00 


6.90 
5.63 


12.24 
11.25 


4.31 
2.50 


8 Subjects 














8 Subjects 












Mean 
Median 


30.04 
29.00 


64.20 
63.77 


4.50 
0.20 


11.57 
15.43 


0.00 
0.00 


0.06 
0.00 




Mean 
Median 


59.19 
58.59 


66.11 
64.65 


41.19 
43.75 


14.34 
10.06 


23.72 
28.03 


5.85 
4.49 


10 Subjects 














10 Subjects 












Mean 
Median 


35.31 
30.16 


63.80 
64.84 


0.15 
0.00 


13.02 
13.13 


0.00 
0.00 


0.89 
0.31 




Mean 
Median 


60.42 
57.50 


73.02 

75.78 


38.85 
41.09 


22.92 
23.59 


30.36 

28.75 


10.94 
5.63 



TABLE 4 

Clustering error (%) of different algorithms on the Extended Yale B dataset 

after applying RPCA simultaneously to all the data in each trial. 

I Algorithm | LSA | SCC | LRR | LRR-H | LRSC | SSC^ 
2 Subjects 



Mean 
Median 


32.53 
47.66 


9.29 
7.03 


7.27 
6.25 


5.72 
3.91 


5.67 
4.69 


2.09 

0.78 



3 Subjects 



Mean 
Median 


53.02 
51.04 


32.00 
37.50 


12.29 
11.98 


10.01 
9.38 


8.72 
8.33 


3.77 
2.60 



5 Subjects 



Mean 
Median 



58.76 
56.87 



53.05 
51.25 



19.92 
19.38 



15.33 
15.94 



10.99 
10.94 



6.79 
5.31 



8 Subjects 



Mean 
Median 



62.32 
62.50 



66.27 
64.84 



31.39 
33.30 



28.67 
31.05 



16.14 
14.65 



10.28 

9.57 



10 Subjects 



Mean 
Median 



62.40 
62.50 



63.07 
60.31 



35.89 
34.06 



32.55 
30.00 



21.82 
25.00 



11.46 
11.09 



7.2.2 Applying RPCA simultaneously on all subjects 

In practice, we cannot apply RPCA separately to the data in each 
subject because the clustering is unknown. In this section, we deal 
with sparse outlying entries in the data by applying the RPCA 
algorithm to the collection of all data points for each trial prior 
to clustering. The results are shown in Table |4] from which we 
make the following conclusions: 

- The clustering error for SSC is low for all different number 
of subjects. Specifically, SSC obtains 2.09% and 11.46% for 
clustering of data points in 2 and 10 subjects, respectively. 

- Applying RPCA to all data points simultaneously may not be 
as effective as applying RPCA to data points in each subject 
separately. This comes from the fact that RPCA tends to bring the 
data points into a common low-rank subspace which can result in 
decreasing the principal angles between subspaces and decreasing 
the distances between data points in different subjects. This 
can explain the increase in the clustering error of all clustering 
algorithms with respect to the results in Table [3] 

7.2. 3 Using original data points 

Finally, we apply the clustering algorithms to the original data 
points without pre-processing the data. The results are shown in 
Table [5] from which we make the following conclusions: 



- The SSC algorithm obtains a low clustering error for all 
numbers of subjects, obtaining 1.86% and 10.94% clustering error 
for 2 and 10 subjects, respectively. In fact, the error is smaller than 
when applying RPCA to all data points. This is due to the fact 
that SSC directly incorporates the corruption model of the data 
by sparse outlying entries into the sparse optimization program, 
giving it the ability to perform clustering on the corrupted data. 

- While LRR also has a regularization term to deal with the 
corrupted data, the clustering error is relatively large especially 
as the number of subjects increases. This can be due to the fact 
that there is not a clear relationship between corruption of each 
data point and the LRR regularization term in general |38|. On 
the other hand, the post processing step of LRR-H on the low- 
rank coefficient matrix helps to significantly reduce the clustering 
error, although it is larger than the SSC error. 

- As LRSC tries to recover error-free data points while finding 
their low-rank representation, it obtains smaller errors than LRR. 

- LSA and SCC do not have an explicit way to deal with corrupted 
data. This together with the fact that the face images of each 
subject have relatively a large number of neighbors in other 
subjects, as shown in Figure |9] (right), result in low performances 
of these algorithms. 

7. 2. 4 Computational time comparison 

The average computational time of each algorithm as a function 
of the number of subjects (or equivalently the number of data 
points) is shown in Figure [12] Note that the computational time 
of SCC is drastically higher than other algorithms. This comes 
from the fact that the complexity of SCC increases exponentially 
in the dimension of the subspaces, which in this case is d = 9. 
On the other hand, SSC, LRR and LRSC use fast and efficient 
convex optimization techniques which keeps their computational 
time lower than other algorithms. The exact computational times 
are provided in the supplementary materials. 

8 Conclusions and Future Work 

We studied the problem of clustering a collection of data points 
that lie in or close to a union of low-dimensional subspaces. 
We proposed a subspace clustering algorithm based on sparse 
representation techniques, called SSC, that finds a sparse rep- 
resentation of each point in the dictionary of the other points. 
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Number of subjects 



Fig. 1 2. Average computational time (sec.) of tine algorithms on the 
Extended Yale B dataset as a function of the number of subjects. 

builds a similarity graph using the sparse coefficients, and ob- 
tains the segmentation of the data using spectral clustering. We 
showed that, under appropriate conditions on the arrangement of 
subspaces and the distribution of data, the algorithm succeeds 
in recovering the desired sparse representations of data points. 
A key advantage of the algorithm is its ability to directly deal 
with data nuisances, such as noise, sparse outlying entries, and 
missing entries as well as the more general class of affine 
subspaces by incorporating the corresponding models into the 
sparse optimization program. Experiments on real data such as 
face images and motions in videos showed the effectiveness of 
our algorithm and its superiority over the state of the art. 

Interesting avenues of research, which we are currently investi- 
gating, include theoretical analysis of the subspace-sparse recov- 
ery in the presence of noise, sparse outlying entries, and missing 
entries in the data. As our extensive experiments on synthetic and 
real data show, the points in each subspace, in general, form a 
single component of the similarity graph. Theoretical analysis of 
the connectivity of the similarity graph for points in the same 
subspace in a probabilistic framework would provide a better 
understanding for this observation. Finally, making the two steps 
of solving a sparse optimization program and spectral clustering 
applicable to very large datasets is an interesting and a practical 
subject for the future work. 
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Appendix 

Proof of Proposition[1] 

In this section, we prove the resuU of Proposition [T] in the paper 
regarding the optimization program 



min ||C||i+Ae||^|| 



A., 



z\\l 



s.t. Y = YC^E^Z, diag(C)=0. 



(35) 



The result of the proposition suggests to set the regularization 
parameters as 



Ae = OLe/ lie, ^z = OLzjl^z, 

where o^e , Q^^ > 1 and /ig and /i^ are defined as 



Me 



mmmax||i/||i, 



l^z 



mm max \y^ yA 



(36) 



(37) 



We use the following Lemma in the theoretical proof of the 
proposition. Proof of this Lemma can be found in B9ll . 

Lemma 1: Consider the optimization program 



min ||c||i + -||2/-Ac| 



(38) 



For A < ||A^y||oo, we have c = 0. 

Proposition [if Consider the optimization program ( [35] ). With- 
out the term Z, if Ag < l//ie, then there exists at least 
one data point y^ for which in the optimal solution we have 
{ci^Ci) = (0, i/^). Also, without the term E, if A^ < l/zi^, then 
there exists at least one data point y^ for which (q, z^) = (0, y^). 

Proof: Note that solving the optimization program ( [35] ) is 
equivalent to solving N optimization programs as 



min ||ci||i + Aelle^l 



A. 



^i|l2 



2 " ^"^ (39) 

s.t. yi = Yci^ei^ Zi, q^ = 0, 

where c^, e^, and Zi are the i-th columns of C, E, and Z, 
repsectively. 

(a) Consider the optimization program ( [39] ) without the term Zi 
and denote the objective function value by 



cost(ci,ei) = ||ci||i + Ae||ei||i. 



(40) 



Note that a feasible solution of ( [39] ) is given by (0, e^) for which 
the value of the objective function is equal to 



cost(0,ei) = Aelli/Jli. 



(41) 



On the other hand, using matrix norm properties, for any feasible 
solution (ci,ei) of (|39l> we have 



WiWi 



ll'ci + eilli <(max||yj|i)||ci||i 



j¥» 



ei 1, 



(42) 



where we used the fact that cu = 0. Muhiplying both sides of 
the above inequaUty by Ag we obtain 

cost(0,yi) = Ae||F||i < (Aemax||y.||i) ||ci||i + Ae||ei||i, 

(43) 
Note that if Ae<;^^^;^ 
have 



^ — u- , then from the above equation we 

cost(0, yj < cost(ci, Ci). (44) 
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In other words, (c^ = 0, e^ = y^) achieve the minimum 
cost among all feasible solutions of ([39]). Hence, if Ag < 

^^^^i ^g^^. "^11 .11 , then there exists £ G {1, • • • , N} such that in 
the solution of the optimization program ( [35] ) we have (q, e^) = 

(0,?/,). 

(b) Consider the optimization program ( [39] ) without the term e^, 
which, using Zi = y^ — Yci, can be rewritten as 



mm 



l|C, 1 



A. 



IVi 



y Ci\\2 S. t. Q^ 



0. 



(45) 
, the 



From Lemma [T] we have that, for A^ < ] — t — r 

solution of ( [45] ) is equal to c^ = 0, or equivalently, the so- 
lution of ( [39] ) is given by {ci^zi) = (0,i/J. As a result, if 



Xz < max^ 



7, then there exists £ e {1, • ' ' ^ ^} such 



that in the solution of the optimization program ( [35] ) we have 

{ci.Zi) = (0,2/^). 

n 



Proof of TheoremH] 

In this section, we prove Theorem [T] in the paper, where we 
showed that for data points in a union of independent subspaces, 
the solution of the ^^ -minimization recovers subspace-sparse 
representations of data points. 

Theorem [If Consider a collection of data points drawn from 
n independent subspaces {Si}^^^ of dimensions {di}^^^. Let Yi 
denote Ni data points in Si, where rank(l^i) = di, and let Y_i 
denote data points in all subspaces except S^. Then, for every Si 
and every nonzero y in Si, the £q-minimization program 



argmm 



s.t. y = [Y, Y_i] 



(46) 



for q < oo, recovers a subspace-sparse representation, i.e., c* ^ 
and c*_ = 0. 

Proof: We prove the result using contradiction. Assume 
c^ 7^ 0. Then we can write 



y = Y.c" + Y_ 



(47) 



Since y is a data point in subspace Si, there exists a c such that 
y = Y iC. Substituting this into ( [47] ) we get 



Y,{^c- 



(48) 



Note that the left hand side of equation ( [48] ) corresponds to a point 
in the subspace Si while the right hand side of ( [48] ) corresponds to 
a point in the subspace ^j^iSj. By the independence assumption, 
the two subspaces Si and ^j^iSj are also independent hence 
disjoint and intersect only at the origin. Thus, from ( [48] ) we must 
have Y_iC^ = and from ( [47] ) we obtain y = YiC*. In other 

words, [c*^ 0^] is a feasible solution of the optimization 
problem ([46]). Finally, from the assumption of cl y^ 0, we have 



(49) 



\c*' 

[o 


< 

1 


"c*l 
c*_\ 



-" q 



that contradicts the optimality of [c*^ cl] . Thus, we must 
have c* 7^ and cl = 0, obtaining the desired result. □ 



Proof of Theorem[2] 

In this section, we prove Theorem [2] in the paper, where we 
provide a necessary and sufficient condition for subspace-sparse 
recovery in a union of disjoint subspaces. To do so, we consider a 
vector X in the intersection of Si with ®j^iSj and let the optimal 
solution of the ^i -minimization when we restrict the dictionary 
to the points from Si be 



a^ = argmin ||a||i s.t. 



(50) 



We also let the optimal solution of the £i minimization when we 
restrict the dictionary to the points from all subspaces except Si 
be 

s. t. X — Y -i a. (51) 



a_ 



argmm ||a||i 

We show that if for every nonzero x in the intersection of Si with 
^j^iSj, the ^i-norm of the solution of ( [50] ) is strictly smaller than 



the ^i-norm of the solution of ( [5T] ), i.e., 

^x eSiD {®j^iSj),x ^ 



di 



< ll«-^lll, 



(52) 



then the SSC algorithm succeeds in recovering subspace-sparse 
representations of all the data points in Si. 

Theorem [2j Consider a collection of data points drawn from n 
disjoint subspaces {Si}^^i of dimensions {di}^^^. Let Yi denote 
Ni data points in Si, where rank(l^^) = di, and let Y_i denote 
data points in all subspaces except Si. The ii -minimization 



argmm 



s.t. y = [Y, Y_i] 



(53) 



recovers a subspace-sparse representation of every nonzero y in 
Si, i.e., c* 7^ and cl = 0, if and only if ( [52] ) holds. 

Proof: {^=) We prove the result using contradiction. As- 
sume cl 7^ and define 



y 



Yic" 



(54) 



Since y lies in Si and Y id" is a linear combination of points in 
Si, from the first equality in ( [54] ) we have that x is a vector in 
Si. Let a, be the solution of ([50l for x. We have 



y 



X iC X iCli 



y = Yi{c*+ai). 



(55) 



On the other hand, since Y-iC*_ is a linear combination of points 
in all subspaces except Si, from the second equality in ( [54] ) we 
have that a? is a vector in ^j^iSj. Let a_i be the solution of 
( [5T] ) for X. We have 



y 



YiC* 



(56) 



Note that the left hand side of ( [56] ) together with the fact that 
a_^ is the optimal solution of ( [5T] ) imply that 



\a-i\\i < lie! 



From ( [55] ) and ( [56] ) we have that 



ai 







and 



a_ 



feasible solutions of the original optimization program in" 
Thus, we have 



(57) 
are 

m. 







<||c*||i + ||a,||i<||c*||i + ||a_,||i< 



(58) 

where the first inequality follows from triangle inequality, the 
second strict inequality follows from the sufficient condition in 



([52]), and the last inequality follows from ( [57] ). This contradicts the 
optimality of [c*^ c^] for the original optimization program 
in ( [53] ), hence proving the desired result. 

{^=) We prove the result using contradiction. Assume the 
condition in ([52]) does not hold, i.e., there exists a nonzero 
X in the intersection of Si and ®j^iSj for which we have 
||o^-i||i < ll^illi- As a result, for i/ = x, a solution of the li- 
minimization program ( [53] ) corresponds to selecting points from 
all subspaces except Si, which contradicts the subspace-sparse 
recovery assumption. D 

Proof of Theorem[3] 

Theorem [3[ Consider a collection of data points drawn from 
n disjoint subspaces {Si}2^i of dimensions {di}2^i. Let W^ be 

of Y i, where 



r)Dxdi 



|i 2 max COS (^ij) 



(59) 



the set of all full-rank submatrices Yi 
rank(l^^) = di. If the condition 

max ad,{Yi) > ^/d'i\\^ 

holds, then for every nonzero y in Si, the £1 -minimization in ( [53]) 
recovers a subspace-sparse solution, i.e., c* / and cl = OFH 

Proof: We prove the result in two steps. In step 1, we show 
that II a^ 111 < /3i. In step 2, we show that f3-i < ||a_i||i. Then, 
the sufficient condition Pi < P_i establishes the result of the 
theorem, since it implies 



||a,||i<ft </3_, < ||a_,||i, 
i.e., the condition of Theorem [2] holds. 



(60) 



Step 1: Upper bound on the ^i-norm of ( [50] ) Let W^ be the set 
of all submatrices Yi e R^^^^ of Yi that are full column rank. 
We can write the vector x e Si D {^j^iSj) 



x = Yid ^ d = {Y^ Yi)-^Y^ X. 
Using vector and matrix norm properties, we have 



(61) 



l|a||i 



< 



/diWdh = Vdi 



{yJy,)-^yJx\\2 



< Vdi\\{YjY,)-'Yj h,2\\xh 



\x\\2, (62) 



where a^iiYi) denotes the dj-th largest singular value of Yi. 
Thus, for the solution of the optimization problem in (|50](, we 
have 

/di. 



\ai\\i < min ||d||i < min — ■—^— 



\x\\2 = Hi, (63) 



which established the upper bound on the £i-norm of the solution 
of the optimization program in ( [50] ). 

Step 2: Lower bound on the ^i-norm of ( [5T] ) For the solution 
of ( [5T] ) we have x — Y_ia-i. If we multiply both sides of this 
equation from left by x'^ , we get 



— tty tl/ — tty J. — i (Jj — 



(64) 



Applying the Holder's inequality (|i/^i;| < ||i/||oo||'^||i) to the 
above equation, we obtain 



< lli^L 



X 



a_ 



(65) 



11. 



1 1^2 denotes the maximum ^2 -norm of the columns of "K_ 



By recalling the definition of the smallest principal angle between 
two subspaces, we can write 

||x||^ < maxcos(%) ||r-^||i,2 ||a^||2 ||a_,||i, (66) 

where Oij is the first principal angle between Si and Sj and 
||l^_i||i 2 is the maximum ^2 -norm of the columns of 1^_^, i.e., 
data points in all subspaces except Si. We can rewrite ([66]) as 



/?-.^ 



X 2 



maXj^iCos(%) ||y_i||i,2 



<l|a_ 



(67) 



which establishes the lower bound on the ii norm of the solution 
of the optimization program in ( [5T] ). D 



Solving the Sparse Optimization Program 

Note that the proposed convex programs can be solved using 
generic convex solvers such as CV^f^ However, generic solvers 
typically have high computational costs and do not scale well 
with the dimension and the number of data points. 

In this section, we study efficient implementations of the pro- 
posed sparse optimizations using an Alternating Direction Method 
of MultipHers (ADMM) method [SOJ, |i64J. We fist consider the 
most general optimization program 



min ||C||i + Ae||^||i 

iC,E,Z) 



A. 



s.t. 



Y = YC^E 



2 " "" (68) 

Z, C^l = 1, diag(C) = 0, 



and present an ADMM algorithm to solve it. 

First, note that using the equality constraint in ( [68] ), we can 
eliminate Z from the optimization program and equivalently solve 



A. 



min ||C||i+Ae||^||i + - 

(O ,E) Z 



Y -YC 

s.t. C'l = l, diag(C)=0. 



^11 



(69) 



The overall procedure of the ADMM algorithm is to introduce 
appropriate auxiliary variables into the optimization program, 
augment the constraints into the objective function, and iteratively 
minimize the Lagrangian with respect to the primal variables and 
maximize it with respect to the Lagrange multipliers. With an 
abuse of notation, throughout this section, we denote by diag(C) 
both a vector whose elements are the diagonal entries of C and a 
diagonal matrix whose diagonal elements are the diagonal entries 
of C. 

To start, we introduce an auxiliary matrix A G R^^^ and 
consider the optimization program 



min ||C||i + Ae||^||i 

{C,E,A) 



K 



■YA-E\\l 



(70) 



s.t. A^l = l, A = C-diag(C). 

whose solution for (C .,E) coincides with the solution of ( [69] ). 
As we will see shortly, introducing A helps to obtain efficient 
updates on the optimization variables. Next, using a parameter 
p > 0, we add to the objective function of ( [70] ) two penalty terms 



12. CVX is a Matlab-based software for convex programming and can be 
downloaded from ^http://cvxr.com^ 



Algorithm 2 : Solving ([68]) via an ADMM Algorithm 



Initialization: Set maxlter = 10"^, k = 0, and Terminate ^ False. Initialize C'^^\ A'^^\ E'^^\ S'^^\ and A^°^ to zero. 
1: while (Terminate == False) do 
2: update A*^^^^^ by solving the following system of linear equations 

{XzY^Y ^pl^ pll^)A^^+^) = XzY^{Y - E^^^) + p{ll^ + C^^^) - 15^^^^ - aS^\ 

update C(^+^) as 0^^+^^ = J - diag(J), where J = Ti (A^^+^) + A^^Vp)' 

update E^^+^^ as ^(^+^) = rA^(r - rA(^+^)), 

update 5^^+^) as ^^^+1) = 5^^^\ p {A^^^^^^ 1 - 1), 

update A(^+^) as A(^+^) = A^^^ + p(A(^+^) - C^^^^\ 

k ^ k^l, 

if (||A^^)^1- l||oo < e and ||A^^) - C^^^Hoo < e and \\A^^^ - A^^-^^W^o < e and H^^^^ - ^^^-^^Hoo < e or {k > maxlter) 



then 

9: Terminate ^ True 

10: end if 
11: end while 
Output: Optimal sparse coefficient matrix C* = C^^\ 



corresponding to the constraints A^ 1 = 1 and A - 
and consider the following optimization program 



C-diag(C) 



mm 

{C,E,A) 



|C||l + Ae||£;|| 



s.t. 



A'l 






YA 



E\\% 



A^l = l, A 



-(C-diag(C))|| 
diag(C). 



(71) 



Note that adding the penalty terms to ( [70| > do not change its 
optimal solution, i.e., both (JTOji and ( [71] ) have the same solutions, 
since for any feasible solution of ( fTTj i that satisfies the constraints, 
the penalty terms vanish. However, adding the penalty terms 
makes the objective function strictly convex in terms of the 
optimization variables (C, E, A), which allows using the ADMM 
approach. 

Introducing a vector S G K^ and and a matrix A G R^^^ of 
Lagrange multipliers for the two equality constraints in ( |7T] l, we 
can write the Lagrangian function of ( fTTj i as 



E 



C{C,A,E,S,A)= ||C||i+A, 
+ ^\\A^l-m + ^\\A 
+ 5"^(A^l-l) + tr(A"^(A 



A. 



■^11 



Y-YA 

(C-diag(C))||^ 

C + diag(C))), (72) 



M 



where tr(-) denotes the trace operator of a given matrix. The 
ADMM approach then consists of an iterative procedure as 
follows: Denote by {C^^\e^^\A^^^) the optimization variables 
at iteration k, and by {5^^\ A.^^^) the Lagrange multipliers at 
iteration k and 

• Obtain A^^^^^ by minimizing C with respect to A, while 



For large values of N, conjugate gradient methods should 
be employed to solve for A^^^^K 

Obtain C*^^^^^ by minimizing C with respect to C, while 
{A^^\E^^\5^^\A^^^) are fixed. Note that the update on 
C also has a closed-form solution given by 



Ci^+i) = j_diag(J), 



(74) 
(75) 



where 7^(-) is the shrinkage-thresholding operator acting on 
each element of the given matrix, and is defined as 



Tr^(v) = {\v\ -r])^sgn{v). 



(76) 



The operator (•)+ returns its argument if it is non-negative 

and returns zero otherwise. 

Obtain E^ ^^^ by minimizing C with respect to E, while 

also be computed in closed-form as 



E^^^^^ =Tx^{YA^^^^^ -Y), 



(77) 



{c^^\e^^\s 



(fc) a(^)^ 



are fixed. Note that computing the 



derivative of C with respect to A and setting it to zero, we 
obtain 






--X,Y^{Y-E^^^) 
lS^k)T _^ik)^ (73) 



In other words, A^ ^^^ is obtained by solving an A^ x TV 
system of linear equations. When TV is not very large, one 
can simply matrix inversion to obtain A^ ^ ^ from ( [73] ). 



. Having (C^^+^\ A^^^^\ E^^^^^ fixed, perform a gradient 
ascent update with the step size of p on the Lagrange 
multipliers as 

A(^+i) = A^^) + p (A(^+^) - C(^+^)). (79) 

These three steps are repeated until convergence is achieved or 
the number of iterations exceeds a maximum iteration number. 
Convergence is achieved when we have ||A*^^^^1 — l||oo ^ e, 
II A^^) - C^^^lloo < e, II A^^) - A^^-^^lloo < e and \\E^^^ - 
I^llcx) ^ e. where e denotes the error tolerance for the primal 
and dual residuals. In practice, the choice of e = 10~^ works well 
in real experiments. In summary. Algorithm [2] shows the updates 
for the ADMM implementation of the optimization program ( [68] ). 

Computational Time Comparison 

Table [6] shows the computational time of different algorithms on 
the Extended Yale B dataset as a function of the number of 
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TABLE 6 

Average computational time (sec.) of the algorithms on the 
Extended Yale B dataset as a function of the number of subjects. 





LSA 


sec 


LRR 


LRSC 


ssc 


2 Subjects 


5.2 


262.8 


1.6 


1.1 


1.8 


3 Subjects 


13.4 


451.5 


2.2 


1.9 


3.29 


5 Subjects 


62.7 


630.3 


7.6 


5.7 


11.4 


8 Subjects 


180.2 


1020.5 


22.1 


16.3 


42.6 


10 Subjects 


405.3 


1439.8 


255.0 


96.9 


160.3 



subjects. Note that these computational times are based on the 
codes of the algorithms used by their authors. It is important 
to mention that LRR and SSC can be implemented using faster 
optimization solvers. More specifically, LRR can be made faster 
using LADMAP method proposed in: "Z. Lin and R. Liu and 
Z. Su, Linearized Alternating Direction Method with Adaptive 
Penalty for Low-Rank Representation, NIPS 2011." Also, SSC 
can be made faster using LADM method proposed in "J. Yang 
and Y. Zhang. Alternating direction algorithms for £i problems 
in compressive sensing. SIAM J. Scientific Computing, 2010." 



